摘要

使用KaggleTitanic訓練及測試資料進行資料整理、分析、遺漏值填補以及建立預測模型,最後進行資料預測並提交結果,得到的分數為0.79426。

1.匯入資料

讀取kaggle_titanic_train.csv與kaggle_titanic_test.csv。觀察資料結構,發現Age、Fare與Embarked含有遺漏值。

train <- read.csv('kaggle_titanic_train.csv')
test  <- read.csv('kaggle_titanic_test.csv')
summary(train)
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   female:314   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   male  :577   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked
##             :687    :  2   
##  B96 B98    :  4   C:168   
##  C23 C25 C27:  4   Q: 77   
##  G6         :  4   S:644   
##  C22 C26    :  3           
##  D          :  3           
##  (Other)    :186
summary(test)
##   PassengerId         Pclass     
##  Min.   : 892.0   Min.   :1.000  
##  1st Qu.: 996.2   1st Qu.:1.000  
##  Median :1100.5   Median :3.000  
##  Mean   :1100.5   Mean   :2.266  
##  3rd Qu.:1204.8   3rd Qu.:3.000  
##  Max.   :1309.0   Max.   :3.000  
##                                  
##                                         Name         Sex     
##  Abbott, Master. Eugene Joseph            :  1   female:152  
##  Abelseth, Miss. Karen Marie              :  1   male  :266  
##  Abelseth, Mr. Olaus Jorgensen            :  1               
##  Abrahamsson, Mr. Abraham August Johannes :  1               
##  Abrahim, Mrs. Joseph (Sophie Halaut Easu):  1               
##  Aks, Master. Philip Frank                :  1               
##  (Other)                                  :412               
##       Age            SibSp            Parch             Ticket   
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   PC 17608:  5  
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   113503  :  4  
##  Median :27.00   Median :0.0000   Median :0.0000   CA. 2343:  4  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923   16966   :  3  
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000   220845  :  3  
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000   347077  :  3  
##  NA's   :86                                        (Other) :396  
##       Fare                     Cabin     Embarked
##  Min.   :  0.000                  :327   C:102   
##  1st Qu.:  7.896   B57 B59 B63 B66:  3   Q: 46   
##  Median : 14.454   A34            :  2   S:270   
##  Mean   : 35.627   B45            :  2           
##  3rd Qu.: 31.500   C101           :  2           
##  Max.   :512.329   C116           :  2           
##  NA's   :1         (Other)        : 80

將兩個資料合併成一個名為titanic的dataframe,以利後續分析運用。

titanic  <- bind_rows(train, test)

2.填補遺漏值

2.1 Missing values in “Embarked”

觀察登船港口遺漏資料,發現Fare都是80元,Passenger Class是1。

embark_missing <- titanic %>%
  filter(Embarked == "")
head(embark_missing)
##   PassengerId Survived Pclass                                      Name
## 1          62        1      1                       Icard, Miss. Amelie
## 2         830        1      1 Stone, Mrs. George Nelson (Martha Evelyn)
##      Sex Age SibSp Parch Ticket Fare Cabin Embarked
## 1 female  38     0     0 113572   80   B28         
## 2 female  62     0     0 113572   80   B28

過濾掉Embarked遺漏值,以登船港口和乘客等級分別繪製票價盒鬚圖

embark_fare <- titanic %>%
  filter(Embarked != "")
ggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) +
  geom_boxplot() +
  scale_fill_brewer(palette = 7 , type = "div") +
  ggtitle("登船港口和乘客等級之票價盒鬚圖")

發現票價80元與C港口高階乘客票價中位數差不多,故將Embarked兩個遺漏值填補為C。

titanic$Embarked[titanic$Embarked == ""] <- "C"

2.2 Missing values in “Fare”

挑出遺漏Fare的該筆資料來觀看。

fare_missing <- subset(titanic,is.na(titanic$Fare))
##      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
##      Ticket Fare Cabin Embarked
## 1044   3701   NA              S

Pclass == ‘3’,Embarked == ‘s’,觀察符合這些條件的票價情況。

third_s <- titanic[titanic$Pclass == '3' & titanic$Embarked == 'S', ]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   7.854   8.050  14.435  15.900  69.550       1

這些條件下Fare最多落在8左右,與其中位數差不多,故以此填補Fare遺漏值。

titanic$Fare[is.na(titanic$Fare)] <- median(third_s$Fare, na.rm = TRUE)

2.3 Missing values in “Age”

首先排除不太相關的變數,保留有用的變數,也能避免程式跑太久。

mice_data <- titanic[, !names(titanic) %in% c('PassengerId','Name','Ticket','Cabin','Survived')]

使用mice套件處理Age的遺漏值,method使用Random Forest。 再將填補完成的結果輸出,並將原本的Age欄位替換成填補好的資料。

mice_result <- mice(mice_data, method='rf',seed = 9487)
mice_output <- complete(mice_result)
titanic$Age <- mice_output$Age

3.資料視覺化

運用填補好各個遺漏值的資料來進行視覺化圖表繪製與觀察。首先要將填補好遺漏值的titanic資料還原成train與test資料。以具備Survived值的train data進行圖表繪製。   

train <- titanic[1:891,]

以下幾點觀察發現:

  • 整體而言三種階級以及各年齡層中,女性的生存率比男性高。
  • 各階級中,中上層階級的女性更具有生存優勢。
  • 各年齡層中,孩童的生存率較高,除了上層階級的女童部分。

4.建立模型及預測

分別使用Random Forest及Classification And Regression Tree來建立預測模型。

train <- titanic[1:891,]
test <- titanic[892:1309,]
set.seed(9487)
rf_fit <- randomForest(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked , data = train, ntree = 200)
cart_fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, method = "class")

進行測試資料預測。

rf_prediction <- predict(rf_fit, newdata = test[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")])
cart_prediction <- predict(cart_fit, newdata = test[, c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked")],type = "class")

輸出符合上傳格式的csv。

to_submit_rf <- data.frame(test[, "PassengerId"], rf_prediction)
names(to_submit_rf) <- c("PassengerId", "Survived")
write.csv(to_submit_rf,file="to_be_submitted_rf.csv",row.names = FALSE)
to_submit_cart <- data.frame(test[, "PassengerId"], cart_prediction)
names(to_submit_cart) <- c("PassengerId", "Survived")
write.csv(to_submit_cart,file="to_be_submitted_cart.csv",row.names = FALSE)

5.提交至Kaggle

使用隨機森林預測模型的結果分數為0.76555。

使用決策樹預測模型的結果分數為0.79426